12  Importing Data - Part 1

Today we will focus on the practice of importing data - better than last time.

Our framework for the workflow of data visualization is shown in Figure 12.1

Figure 12.1: Tidyverse framework again

Acquiring and importing data is the most complicated part of this course and data visualization in general. This Unit is done now, rather than at the beginning, because of its difficulty and pain - while providing little immediate satisfaction of a cool map or graphic. In my experience, data import and manipulation is 80+% of the work when creating visualizations; it needs to be covered at least nominally in any course on data visualization.

12.1 Load and Install Packages

As always, we should load the packages we need to import the data. There are many specialized data import packages, but tidyverse and sf are a good start and can handle many standard tables and geospatial data files. Remember, you can check to make sure a package is loaded in your R session by checking on the files, plots, and packages panel, clicking on the Packages tab, and scrolling down to tidyverse and sf to make sure they are checked.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE

12.2 Option 1. Point and Click Download, File Save, Read

The basic way to acquire data is the Point and Click method. This is a step-by-step instruction for doing that.

12.2.1 Find and Download Data

Go to CalEnviroScreen

Download the Zipped Shapefile shown in the screenshot in Figure 12.2

Figure 12.2: CalEnviroScreen Shapefile Location

By default, downloads are often placed in a Downloads directory, although you may have changed that on your local machine.

You can skip the next step if you directly save the zip file to your working directory.

12.2.2 Move the Zipped Shapefile to the R Working Directory

By default, downloads are often placed in a Downloads directory, although you may have changed that on your local machine. If this occurred in your download, the zipped needs to be either (a) moved to the R working directory or (b) identify the filepath of the default download directory and work with it from there.

For today, I will only show path (a) because it is good data science practice to keep the data in a directory associated with the visualization.

  1. Identify the directory where the zipped shapefile was downloaded. On my machine, this is a Downloads folder which can be accessed through my web browser after the file download is complete; Figure 12.3 shows an example. The name of the file is calenviroscreen40shpf2021shp.zip.

Figure 12.3: Browser download
  1. Identify the R working directory on your machine using the getwd() function.
[1] "C:/Dev/EA078_Fall2023"
  1. Move calenviroscreen40shpf2021shp.zip from the default download directory to the R working directory. Either drag it, copy and paste it, or cut and paste it. For Macs - use the Finder tool. For PCs, use File Explorer.

  2. Check your Files, Plots, and Packages panel to see the zipped file is identified by RStudio. See the example in Figure 12.4.

Figure 12.4: Files, Plots, and Packages Panel

If you see the calenviroscreen40shpf2021shp.zip in the directory on your machine, congratulations! You are a winner!

12.2.3 Unzip the data - Two Ways

Although the data is in the right place, it is not directly readable while zipped.

12.2.3.1 Point and Click Unzip

I think the process is basically the same for Mac and PC, but we will identify this in class.

  • On a Mac, Double-click the .zip file. The unzipped item appears in the same folder as the .zip file.

  • On a PC, right-clicking on a zipped file will bring up a menu that includes an Extract All option. Choosing the Extract All option brings up a pathname to extract the file to. The default is to extract the zip file to a subfolder named after the zip file.

Again, go to the Files, Plots, and Packages panel and check if there is a folder called calenviroscreen40shpf2021shp as shown in Figure 12.5

Figure 12.5: Shapefile folder is in the working directory!

12.2.3.2 Unzip with Code

Same idea. Use the unzip() function to unzip the zipped shapefile folder. We will save it in a separate directory to test if this way works independently of point and click method. The unzip() function needs two arguments - the path of the zipfile =, and the export directory name exdir =.

directory <- 'CalEJ4'
unzip(zipfile = 'calenviroscreen40shpf2021shp.zip', exdir = directory)

Check the Files panel. Check for a new CalEJ4 folder; ?fig-panel3 shows how it looks on my machine.

Another folder is in the working directory! ### Import the Shapefile

The sf library is used to import geospatial data. The read_sf() is great at read and identifying the type of spatial file.

Shapefiles are the esri propietary geospatial format and are very common.

The CalEnviroScreen data are in the shapefile format, which is a bunch of individual files organized in a folder directory. In the calenviroscreen40shpf2021shp directory, there are 8 individual files with 8 different file extensions. We can ignore that and just point read_sf() at the directory and it will do the rest. The dsn = argument stands for data source name which can be a directory, file, or a database.

CalEJ <- read_sf(dsn = directory)

Check the Environment panel after running this line of code. Is there a CalEJ file with 8035 observations of 67 variables present?

If so, success is yours! Let’s make a map of Pesticide census tract percentiles to celebrate with Figure 12.6!

12.2.4 Visualize the data

CalEJ <- CalEJ  |>  
  filter(PesticideP >=0) |> 
  st_transform("+proj=longlat +ellps=WGS84 +datum=WGS84")

palPest <- colorNumeric(palette = 'Greys', domain = CalEJ$PesticideP)

  leaflet(data = CalEJ) %>% 
    addTiles() %>% 
    addPolygons(color = ~palPest(PesticideP),
                fillOpacity = 0.5,
                weight = 2,
                label = ~ApproxLoc) %>% 
    addLegend(pal = palPest,
              title = 'Pesticide (%)', 
              values = ~PesticideP)